Looking forward to learning some R. I have a decent bit of experience in programming and data science, but I have done almost nothing in R before so that should be interesting. To be honest, not a 100% sure if I’ll have the time to finish the course but we’ll see. I believe I found the course while browsing WebOodi.
Exercice 2 analysis
library(ggplot2)
library(GGally)
Let’s read the dataset to start.
Let’s look at the dimensions
## [1] 166 7
We can see that the table consists of 166 observations and 7 variables (i.e. 166 rows and 7 columns when viewed as a table).
Now, let’s look at the structure
## 'data.frame': 166 obs. of 7 variables:
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
Both from looking at the data and from the context of the exercise, we can see that each observation encodes a single student: their gender, age, attitude, exam points and the deep (deep approach), stra (strategic approach), surf (surface approach) variables which encode the student’s mean answers to sets of questions meant to measure their approached to learning. All variables except gender (encoded as strings) are encoded as integers.
summary(learning2014)
## age gender points attitude
## Min. :17.00 Length:166 Min. : 7.00 Min. :1.400
## 1st Qu.:21.00 Class :character 1st Qu.:19.00 1st Qu.:2.600
## Median :22.00 Mode :character Median :23.00 Median :3.200
## Mean :25.51 Mean :22.72 Mean :3.143
## 3rd Qu.:27.00 3rd Qu.:27.75 3rd Qu.:3.700
## Max. :55.00 Max. :33.00 Max. :5.000
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
female <- sum(learning2014$gender == "F")
male <- sum(learning2014$gender == "M")
print(paste("The datset contains", toString(male), "male and", toString(female), "female students"))
## [1] "The datset contains 56 male and 110 female students"
From above we can see the means, medians, ranges (min, max) and quartiles of the variables. For example, the students’ ages range from 21 to 55 with a mean of 25.5 and a median of 22.
110 of the students are female and 56 are male
Other than age, which skews to the younger side, the variables look more or less normally distributes (except gender, of course, which is a categorical variable). Points seem to also have a small peak at the lower side (i.e. a small group of students got a very low amount of points and the rest seem normally distributed around the middle).
The highest correlation (0.43) is between attitude and points. Surface and deep have a moderate negative correlation (-0.32) a well. Maybe a slight correlation between surf and stra and surf and attitude as well.
We also plot data from male and female students separately.
Doesn’t seem to be any major differences except maybe in attitude.
Now, let’s create our regression model. Points will be dependent variable and we’ll choose attitude (attitude toward statistics), stra (strategic approach score) and surf (surface approach score) as the independent variables. The exercise didn’t seem to specify which one’s we should use so let’s just go with these.
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
We can see that attitude is the only significant variable in the model (i.e p < 0.05). The coefficient for attitude is 3.4 (i.e. when an increase of one in attitude score predicts an increase of 3.4 in exam points). Let’s get rid of the non-significant variables and fit a new model.
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Attitude remains highly significant. With an estimate of ~3.5, the model predicts 3.5 increase in points when attitude increases by one. With an intercept of 11.6 this means that for someone with an attitude of 0 the model would predict a points score of 11.6 and for someone with an attitude of 5 it would predict a points score of 11.6 + 3.5*5 = 29.1.
R-Squared of 0.19 means that 19% of the variance of the dependent variable (i.e. exam points) can be explained by the the independent variable (attitude).
Let’s do some plots for further investigation. Let’s start by plotting residuals vs fitted values.
Residual variance seems quite constant (maybe slightly smaller at high points but hard to tell due to smaller number of observations as well) across across the fitted values, i.e. the assumption of equal variances seems reasonable.
There are a a few residuals in the -15 - -20 range suggesting possible outliers. Q-Q plot appears quite linear suggesting that the data is normally distributed.
Leverage essentially describes how much of an impact a data point has on the model. Note a high-leverage point does not necessarily mean that it’s an outlier, it could fit very well with the model but have a high-leverage due to being far away from other data points (We can see this in fact above in fact, the highest-leverage point (approx 0.04) in the data has a fairly small residual).
Regardless, none of the leverages are particularly large and even the relatively larger ones don’t appear to have particularly large residuals.
In summary, the model assumptions appear valid.
Exercice 3 analysis
Start by reading the dataset and looking up the variable names
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", header=TRUE, sep = ",")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset describes students in two Portuguese schools. Each observation describes a single student, including their age, gender, performance at school. For this exercise we’ll be mainly looking at alcohol consumption.
Specifically, we’ll examine the relationship between high alcohol consumption and final student grades (G3), family relationship (famrel), absences, and age.
Intuitively we would expect high alcohol use to correlate with worse grades, possibly worse family relationships, more absences and higher age.
Lets look at grades first.
ggplot(data = alc, aes(x = G3, fill = high_use)) + geom_bar()
table(highuse = alc$high_use, grades = alc$G3)
## grades
## highuse 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## FALSE 33 0 3 13 5 15 18 34 31 19 13 24 28 13 4 11 5 1
## TRUE 6 1 4 2 2 16 9 22 12 11 12 3 4 4 2 2 0 0
ggplot(alc, aes(x = high_use, y = G3)) + geom_boxplot()
Students with the highest grades seem to mostly be in the lower use category.
Let’s look at family support next.
ggplot(data = alc, aes(x = famrel, fill = high_use)) + geom_bar()
table(highuse = alc$high_use, family = alc$famrel)
## family
## highuse 1 2 3 4 5
## FALSE 7 9 40 131 83
## TRUE 2 9 26 52 23
for (i in sort(unique(alc$famrel))) {
sub = alc[alc$famrel==i,]
high = sum(sub$high_use==TRUE)
low = sum(sub$high_use==FALSE)
ratio = round(high/(low + high)*100, digits = 2)
print(paste(toString(ratio), "% of students with a family relationship of", toString(i), "are in the high alcohol category."))
}
## [1] "22.22 % of students with a family relationship of 1 are in the high alcohol category."
## [1] "50 % of students with a family relationship of 2 are in the high alcohol category."
## [1] "39.39 % of students with a family relationship of 3 are in the high alcohol category."
## [1] "28.42 % of students with a family relationship of 4 are in the high alcohol category."
## [1] "21.7 % of students with a family relationship of 5 are in the high alcohol category."
It certainly looks like a high family relationship predicts a lower use of alcohol (at least when famrel >= 2. Relatively low number samples in the famrel == 1 category might explain the low ration of high alcohol consumption).
Then absences.
ggplot(data = alc, aes(x = absences, fill = high_use)) + geom_histogram(bins = 20)
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot()
Hmm… A bit hard to make out, but it looks like those with more absences are more likely to be in the high use category.
Finally,lets look at age
ggplot(data = alc, aes(x = age, fill = high_use)) + geom_bar()
table(highuse = alc$high_use, age = alc$age)
## age
## highuse 15 16 17 18 19 20 22
## FALSE 63 79 65 55 7 1 0
## TRUE 18 28 35 26 4 0 1
for (i in sort(unique(alc$age))) {
sub = alc[alc$age==i,]
high = sum(sub$high_use==TRUE)
low = sum(sub$high_use==FALSE)
ratio = round(high/(low + high)*100, digits = 2)
print(paste(toString(ratio), "% of students of the age of", toString(i), "are in the high alcohol category."))
}
## [1] "22.22 % of students of the age of 15 are in the high alcohol category."
## [1] "26.17 % of students of the age of 16 are in the high alcohol category."
## [1] "35 % of students of the age of 17 are in the high alcohol category."
## [1] "32.1 % of students of the age of 18 are in the high alcohol category."
## [1] "36.36 % of students of the age of 19 are in the high alcohol category."
## [1] "0 % of students of the age of 20 are in the high alcohol category."
## [1] "100 % of students of the age of 22 are in the high alcohol category."
It looks like alcohol consumption somewhat jumps up at at the age 17 and above.
logreg <- glm(high_use ~ G3 + famrel + absences + age, data = alc, family = "binomial")
summary(logreg)
##
## Call:
## glm(formula = high_use ~ G3 + famrel + absences + age, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7108 -0.8076 -0.6780 1.2143 1.9825
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.645322 1.795231 -1.474 0.140609
## G3 -0.009729 0.025802 -0.377 0.706121
## famrel -0.277364 0.122979 -2.255 0.024110 *
## absences 0.062522 0.017618 3.549 0.000387 ***
## age 0.155649 0.102233 1.522 0.127887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 436.02 on 377 degrees of freedom
## AIC: 446.02
##
## Number of Fisher Scoring iterations: 4
Both family relationship and absences have a significant effect. Neither age or grades have a significant effect.
Let’s look at the odd ratios and their 95% confidence intervals.
exp(cbind(coef(logreg), confint(logreg)))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.07098251 0.002037894 2.3734971
## G3 0.99031793 0.941826103 1.0423755
## famrel 0.75777871 0.594401560 0.9644452
## absences 1.06451794 1.030446671 1.1039063
## age 1.16841640 0.956650325 1.4300500
Increase of one in absences increases the odds of a student being in the high alcohol use category by approximately 1.06. (lower and upper boundaries of 1.03 and 1.10). While this might seem low considering that a student can have lots of absences this can actually a relatively large effect. For example, for a student with 20 absences, the model would predict the odds of being in the high alcohol category would rise to approximately 3.2 times that of a student with no absences.
In contrast, a high family relationship lowers the odds of a student being in the high alcohol category. For example, according to the model the odds of being in the high alcohol category is approximate 0.76 times for a student with a family relationship of 5 than a student with a family relationship of 4. The upper boundary of the confidence interval is below 1 as well giving us some confidence that the
In summary, the model supports our hypotheses about absences and family relationships, but not grades or age.
Let’s explore the predictive power of the model
prob <- predict(logreg, type = "response")
alc <- mutate(alc, probability = prob)
alc <- mutate(alc, prediction = prob > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 260 10
## TRUE 98 14
The training error is (98 + 10)/382 = 0.283 with most errors being false negatives. This is certainly better than guessing at 50% probability, but given that only about 30% of students are in the high alcohol use category are in the first place, it’s actually not all that much better than just assuming that a student is in the low-alcohol consumption group.
Exercice 4 analysis
Let’s start by loading the data and looking at its structure.
library(MASS)
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The dataset regards housing values in different towns of the Boston metropolitan area. It has 506 observations with 14 variables including the median value of owner-occupied homes in the town (medev), the per capita crime rate of the town (crim), and proportion of non-retail business acres per town (indus). The descriptions for all the variables can be found on here.
Let’s plot a correlation matrix to visualize possible correlations between variables.
#load the libraries
library(corrplot)
library(tidyverse)
#generate and plot correlation matrices
corr_matrix <- cor(Boston)
corr_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(corr_matrix, method="circle", type = "upper", cl.pos = "b", , order = "hclus", tl.pos = "d", tl.cex = 0.6)
We can see some strong positive correlations including tax-tad, tax-indus, nox-indus, and nox-age. Strong negative correlations include: dis-nox, dis-indus, dis-age, and medv-lstat.
For example if look at nox (nitrogen oxides concentration) and dis (distance to Boston employment centres), we can see a clear relationship.
ggplot(data = Boston, aes(x = nox, y = dis)) + geom_point()
Now, let’s look at the summary.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We can see that there’s quite a bit of range in regards to many of the variables. Median value of owner-occupied homes for example varies from $5000 to $50000 (we can see that the dataset is rather old considering how low the prices area).
Let’s standardize the dataset with the scale function
scaled_data <- scale(Boston)
scaled_data <- as.data.frame(scaled_data)
summary(scaled_data)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
What we’ve done is scale each variable by subtracting the mean from each value and dividing by the standard deviation. This means that the mean of each variable is set to 0 and each value tells us how many standard deviations away from 0 the value is.
Now, let’s create a categorical variable of the crime rate.
bins <- quantile(scaled_data$crim)
crime <- cut(scaled_data$crim, breaks = bins, labels = c("low","med_low","med_high","high"), include.lowest = TRUE)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Now, let’s replace the old crim variable with the new one.
scaled_data <- dplyr::select(scaled_data, -crim)
scaled_data <- data.frame(scaled_data, crime)
And now let’s split the data into training and test sets.
ind <- sample(nrow(scaled_data), size = nrow(scaled_data)*0.8)
train <- scaled_data[ind,]
test <- scaled_data
Let’s fit linear discriminant analysis (LDA) using the training set with our new categorial crime variable as the target variable.
lda_fit <- lda(crime ~ ., data = train)
Now let’s visualize the LDA with a biplot.
lda.arrows <- function(x, myscale = 1, arrow.heads = 0.1, color = 'purple', tex = 0.75, choices = c(1,2)) {
heads = coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale*heads[,choices[1]],
y1 = myscale*heads[,choices[2]],
col = color, length = arrow.heads)
text(myscale*heads[,choices], labels = row.names(heads), cex = tex, col = color, pos = 3)
}
classes = as.numeric(train$crime)
plot(lda_fit, dimen = 2, col = classes)
lda.arrows(lda_fit, myscale = 2)
First, let’s save and remove the crime variable from the test set.
correct <- test$crime
test <- dplyr::select(test, -crime)
Now, let’s predict
predictions <- predict(lda_fit, newdata = test)
t <- table(correct = correct, predicted = predictions$class)
t
## predicted
## correct low med_low med_high high
## low 98 25 4 0
## med_low 31 73 22 0
## med_high 3 29 88 6
## high 0 0 1 126
The high class is predicted very well, but low is quite often mistaken as med_low and med_high as med_low. Still, the model seems quite reasonable; almost all of the errors seem are one category away from the real class.
Let’s reload the dataset, standardize it and calculate the distances between observations.
data(Boston)
scaled_data <- scale(Boston)
dist <- dist(scaled_data)
Let’s first just run a k-means algorithm with 3 clusters.
km <-kmeans(scaled_data, centers = 3)
pairs(scaled_data, col = km$cluster)
Hmm… can’t make too much sense out of the clusters with this visualization. Let’s get back to this later.
Now, let’s try to figure out how many clusters we should optimally use by looking at how WCSS (within cluster sum of squares) behaves when we change the number of clusters.
library(ggplot2)
set.seed(123)
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(scaled_data, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
Based on the big drop around 2, let’s use 2 as our number of klusters.
Now let’s run our k-means algorithm.
km <-kmeans(scaled_data, centers = 2)
scaled_data <- as.data.frame(cbind(scaled_data, km$cluster))
Now let’s visualize the clusters.
pairs(scaled_data, col = km$cluster)
It’s a little hard to make out. Let’s try another way to visualize the clusters.
library(GGally)
scaled_data <- data.frame(scaled_data)
ggpairs(scaled_data, aes(col = as.factor(km$cluster), alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
The clusters appear rather distinct now. Blue cluster seems to include mostly towns that have more industry, higher accessibility to highways, higher taxes, have worse air quality (nox), have lower status population, are closer to the Boston employment centers and have lower-valued homes.
Exercice 5 analysis
Let’s start by loading the data and looking at its structure.
human <- read.csv('./data/human.csv', row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The data contains 155 rows with 8 variables. Each row describes a single country’s life expectancy (Life.Exp), maternal mortality rate (Mat.Mor), expected years of schooling (Edu.Exp), gross national income per capita (GNI), adolescent birth rate (ado.birth), proportion of women in parliament (Parli.F), female/male ratio in labour force (Labo.FM) and female/male ratio who have attained secondary level education (Edu2.FM).
Next, we’ll print out the summaries of the variables.
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Let’s look at some visualizations.
ggpairs(human)
cor(human) %>% corrplot(method="circle", type = "upper", cl.pos = "b", , order = "hclus", tl.pos = "d", tl.cex = 0.6)
In terms of distributions, it’s clear that many of the variables do not follow a normal distribution. For example GNI is heavily skewed to the left.
In terms of relationships, We can see correlations between many of the variables. For example, maternal mortality has a negative correlation with Life.Exp, Edu.Exp and Edu2.FM and a positive correlation with Ado.Birth. Life expectancy on the other hand has a positive correlation with Edu.Exp, GNI and Edu2.FM.
Let’s perform a PCA on the non-standardized dataset and do a bi-plot for the first two principal components.
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
We can see that effectively all (99.99%) of the variance in the data is explained by the PC1 component (which seems to be mainly driven by GNI). This is not very useful. Let’s see if we can do better by standardizing the data.
First, lets standardize all the variables.
human_scaled <- scale(human)
Now lets re-do the pca and the bi-plot.
pca_human_std <- prcomp(human_scaled)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
When looking at the dimensions of the first two principal components, we can see that higher Edu.Exp, GNI, Edu2.FM and Life.Exp drive PC1 to the left whereas higher Mat.Mor and Ado.Birth to the right. PC2 is mainly contributed to by Parl.F and Labo.FM.
Based on this, we can see that richer and more developed countries (countries with higher GNI,education and life expectancy) tend to be situated on the left and poorer ones on the right. Countries with higher female representation in parliament and/or participation in the labour force trend upwards and countries with lower female labor participation and parliamentary representation downward. For example, Iceland which is both highly developed and has high female participation in the workforce and parliament is on the upper left whereas Qatar which is rich but has low female representation in parliament and workforce is on lower right.
Let’s load and look at the dataset
library(FactoMineR)
library(tidyr)
data(tea)
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
There’s a lot of stuff, so let’s only include the variables we are interested in (same as in the datacamp exercise).
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_data <- dplyr::select(tea, one_of(keep_columns))
dim(tea_data)
## [1] 300 6
str(tea_data)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The dataset describes participant’s answers to a questionnaire on tea. There are 300 rows and 6 variables now. Next, we’ll conduct and visualize the MCA.
mca <- MCA(tea_data, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_data, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")
Color indicates which variable the category belongs and distance between points measures their similarity. For example those who buy/use unpackaged tea also tend to buy from tea shops whereas as those who use/buy tea bags tend to buy from chain stores.
(more chapters to be added similarly as we proceed with the course!)